Visualization and Analysis of Gun Violence in the United States

Final Tutorial CMSC320 - Apoorv Bansal, Shubhankar Sachdev, Andrew Huang

Introduction

The issue of gun violence is one of the most prominent ones in our nation today. Debates surrounding gun regulation and gun laws dominate our political atmosphere, and it is one of many topics that severely divide the citizens of the USA. In recent years, the media has focused more and more on incidents of gun violence such as mass shootings in public places like schools, and houses of worship. The goal of this tutorial is to explore the issue of gun violence in America and use the data science pipeline to help inform potential policy decisions in order to minimize this problem in the future.

Required Libraries and Modules

In [32]:
!pip install folium
import re
import numpy as np
import matplotlib.pyplot as plt
import folium
import requests
from folium.plugins import MarkerCluster
from sklearn import datasets
from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
import pandas as pd
import math
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, KFold
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from scipy import stats
import random
Requirement already satisfied: folium in /anaconda2/envs/py36/lib/python3.6/site-packages (0.7.0)
Requirement already satisfied: branca>=0.3.0 in /anaconda2/envs/py36/lib/python3.6/site-packages (from folium) (0.3.1)
Requirement already satisfied: jinja2 in /anaconda2/envs/py36/lib/python3.6/site-packages (from folium) (2.10)
Requirement already satisfied: requests in /anaconda2/envs/py36/lib/python3.6/site-packages (from folium) (2.19.1)
Requirement already satisfied: numpy in /anaconda2/envs/py36/lib/python3.6/site-packages (from folium) (1.15.1)
Requirement already satisfied: six in /anaconda2/envs/py36/lib/python3.6/site-packages (from folium) (1.11.0)
Requirement already satisfied: MarkupSafe>=0.23 in /anaconda2/envs/py36/lib/python3.6/site-packages (from jinja2->folium) (1.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda2/envs/py36/lib/python3.6/site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /anaconda2/envs/py36/lib/python3.6/site-packages (from requests->folium) (2018.8.24)
Requirement already satisfied: urllib3<1.24,>=1.21.1 in /anaconda2/envs/py36/lib/python3.6/site-packages (from requests->folium) (1.22)
Requirement already satisfied: idna<2.8,>=2.5 in /anaconda2/envs/py36/lib/python3.6/site-packages (from requests->folium) (2.6)
You are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

1. Data Collection and Tidying

The first two stages of the data science pipeline are the collection and tidying of data. Normally these are considered to be two separate steps, however our collection was relatively simple as we downloaded our data from the source, therefore we combined tidying and collection into one step.

Our collection method was to download the dataset from Dataset. Kaggle is a community of data scientists where users can find and share various datasets. Once we downloaded the dataset, we loaded into a Pandas DataFrame. A DataFrame is a tablelike data structure organized by rows and columns. They are extremely useful as reading the data is significantly easier than using other data structures. Additionally, Pandas has many libraries and functions that can be used to perform complex operations and manipulations to the DataFrame. More information about DataFrames can be found at DataFrames.

The next phase was the tidying of data. We began by dropping columns such as those that stored various urls that we knew would not be relevant to our analysis.

The next thing we wanted to do was make some of the columns more readable. The first thing we did was clean all of the lists. In the original dataset, a cell with a list was stored with a digit indicating an index followed by ::, followed by the data of that index. We wrote a simple function that would clean a string in this format and convert it to an actual list. We ran this on ever column with a list in the untidy format, and it resulted in the data being significantly more readable. Additionally, we simplified the "gun_type" column to reflext whether the gun was a pistol/shotgun or an automatic weapon, rather than assigning a specific gun model. This would make it easier for those who are not knowledgeable about guns to easily be able to determine the type of weapon used.

In [33]:
data = pd.read_csv('gun_violence.csv')

# Lets begin to tidy the data, going through the columns there are some we can drop
data = data.drop(columns = ['gun_stolen','participant_name','participant_name', 'incident_url', 'source_url','incident_url_fields_missing', 'location_description', 'state_house_district', 'state_senate_district', 'sources'], axis = 1)
# After dropping some columns lets begin to make the data a bit more usable in terms of participant_age
# First lets iterate through each row of the dataframe, we see the age is divided up into (int)::Age. We know the age
# is 2 digits long, so we can simply get the last 2 indeces of the string. 
# Create a for loop to go through each row. 
# Different Gun types: Handgun, AR-15 AK-47 shotgun Auto rifle

count = 0
# Returns a cleaned list from the messy input of a digit followed by :: 
def clean_str(strIn):
    strIn = strIn.split('||')
    strList = []
    for i in strIn:
        regex = re.split('\d::', i)
        strList.append(regex[-1])
    return strList
for i, r in data.iterrows():
    raw_age = str(r['participant_age'])
    raw_gender = str(r['participant_gender'])
    raw_ageGroup = str(r['participant_age_group'])
    type_shooting = str(r['incident_characteristics'])
    raw_status = str(r['participant_status'])
    raw_participant_type = str(r['participant_type'])
    gun_used = str(r['gun_type'])
    if 'AK-47' in gun_used or 'AR-15' in gun_used or 'Auto' in gun_used:
        data.at[count, 'gun_type'] = 'Automatic Gun Used'
    else:
        data.at[count, 'gun_type'] = 'Pistol/Shotgun'
    g_list = clean_str(raw_gender)
    data.at[count, 'participant_gender'] = g_list
    group_list = clean_str(raw_ageGroup)
    data.at[count, 'participant_age_group'] = group_list
    clean_part_status = clean_str(raw_status)
    clean_part_type = clean_str(raw_participant_type)
    data.at[count, 'participant_status'] = clean_part_status
    data.at[count, 'participant_type'] = clean_part_type
    if 'Mass Shooting' in type_shooting:
        data.at[count, 'incident_characteristics'] = 'Mass Shooting(4+ Deaths/Injuries)'
    else:
        data.at[count, 'incident_characteristics'] = 'Isolated Shooting(0-3 Deaths/Injuries' 
        
    # Split by || in case there are multiples, if not, this will be a single age in the form of (int)::Age
    # Other wise, it will be a list of (int)::Age
    age_list = raw_age.split('||')
    age = []
    for i in age_list:
        # Gets the last 2 characters of the string. 
        lastTwo = i[-2:]
        age.append(lastTwo)
    data.at[count, 'participant_age'] = age
    count = count + 1
    
    
    
    
# 37.8, -96], 4 coords for folium map 
data['year'] = data.date.str.extract(r'([0-9][0-9][0-9][0-9])', expand=True)
data["year"] = pd.to_numeric(data["year"])
data['harmed'] = data['n_killed'] + data['n_injured']
data.head()
Out[33]:
incident_id date state city_or_county address n_killed n_injured congressional_district gun_type incident_characteristics ... n_guns_involved notes participant_age participant_age_group participant_gender participant_relationship participant_status participant_type year harmed
0 461105 2013-01-01 Pennsylvania Mckeesport 1506 Versailles Avenue and Coursin Street 0 4 14.0 Pistol/Shotgun Mass Shooting(4+ Deaths/Injuries) ... NaN Julian Sims under investigation: Four Shot and... [20] [Adult 18+, Adult 18+, Adult 18+, Adult 18+, A... [Male, Male, Male, Female] NaN [Arrested, Injured, Injured, Injured, Injured] [Victim, Victim, Victim, Victim, Subject-Suspect] 2013 4
1 460726 2013-01-01 California Hawthorne 13500 block of Cerise Avenue 1 3 43.0 Pistol/Shotgun Mass Shooting(4+ Deaths/Injuries) ... NaN Four Shot; One Killed; Unidentified shooter in... [20] [Adult 18+, Adult 18+, Adult 18+, Adult 18+] [Male] NaN [Killed, Injured, Injured, Injured] [Victim, Victim, Victim, Victim, Subject-Suspect] 2013 4
2 478855 2013-01-01 Ohio Lorain 1776 East 28th Street 1 3 9.0 Pistol/Shotgun Isolated Shooting(0-3 Deaths/Injuries ... 2.0 NaN [25, 31, 33, 34, 33] [Adult 18+, Adult 18+, Adult 18+, Adult 18+, A... [Male, Male, Male, Male, Male] NaN [Injured, Unharmed, Arrested, Unharmed, Arrest... [Subject-Suspect, Subject-Suspect, Victim, Vic... 2013 4
3 478925 2013-01-05 Colorado Aurora 16000 block of East Ithaca Place 4 0 6.0 Pistol/Shotgun Isolated Shooting(0-3 Deaths/Injuries ... NaN NaN [29, 33, 56, 33] [Adult 18+, Adult 18+, Adult 18+, Adult 18+] [Female, Male, Male, Male] NaN [Killed, Killed, Killed, Killed] [Victim, Victim, Victim, Subject-Suspect] 2013 4
4 478959 2013-01-07 North Carolina Greensboro 307 Mourning Dove Terrace 2 2 6.0 Pistol/Shotgun Isolated Shooting(0-3 Deaths/Injuries ... 2.0 Two firearms recovered. (Attempted) murder sui... [18, 46, 14, 47] [Adult 18+, Adult 18+, Teen 12-17, Adult 18+] [Female, Male, Male, Female] 3::Family [Injured, Injured, Killed, Killed] [Victim, Victim, Victim, Subject-Suspect] 2013 4

5 rows × 22 columns

2. Data Visualization

The next stage in the data science life cycle is the data visualization stage where we take our tidied data, and as the name implies, create visual elements such as graphs, maps, etc. to more easily and clearly see trends and patterns in our data.

Interactive Map

Our first piece of visualization is to create an interactive map of shootings across the country. We utilized the folium library which is specialized at creating maps. It makes the creation of maps for visualization incredibly easy and has many features such as clustering, and easy zoom. More information about folium can be found at Folium.

Our interactive map will help us see how the shooting incidents are spread out across the United States. The clusters will show us where incidents are concentrated and will allow for us to zoom into regions for additional focus.

In [34]:
# Create an interactive map of shootings across each state, plot a sample since 270,000 points is too many.
# First make a dictionary of the states, then add the coordinates and the type of shooting, deaths and injured. 
state_shootings = {}
for i, r in data.iterrows():
    # some rows are missing lat/long values, so we do not want those rows for our interactive map. 
    if not math.isnan(r['latitude']) or not math.isnan(r['longitude']):
        if r['state'] not in state_shootings:
            # create a tuple with the following format : (lat, long, city, killed, injured)
            state_shootings[r['state']] = [(r['latitude'], r['longitude'], r['city_or_county'], r['n_killed'], r['n_injured'])]
        else:
            state_shootings[r['state']].append((r['latitude'], r['longitude'], r['city_or_county'], r['n_killed'], r['n_injured']))

            
# Next we will take a sample of these tuples from each state and put them in a new dictionary. 
state_shootings_sample = {}
for i in state_shootings:
    sample_size = 500
    if len(state_shootings[i]) < 500:
        sample_size = len(state_shootings[i])
    state_shootings_sample[i] = random.sample(state_shootings[i], sample_size)
    
map_osm = folium.Map(location=[37.8, -102], zoom_start=4)
# allows our map to cluster points for a clean visual of the map. 
mc = MarkerCluster().add_to(map_osm)
for i in state_shootings_sample:
    for j in state_shootings_sample[i]:
        lat = j[0]
        long = j[1]
        # String for popup message when clicking a point. 
        popupStr = 'Town:' + str(j[2]) + '|| Death:' + str(j[3]) + '|| Injured:' + str(j[4])
        mc.add_child(folium.CircleMarker(location=[lat, long], radius = 10, popup = popupStr, color = '#DC143C', fill_color = '#DC143C'))
map_osm.add_child(mc)
map_osm
Out[34]:

Heatmap

Our next piece of data visualization is a heat map. This provides us with slightly different insights from the interactive map. With the heat map we get a state by state view of where shooting incidents are concentrated. It provides us with a very easy and quick view of which states tend to have more shooting incidents.

In [35]:
# We will start making the heat map of each state by first converting the state to the abbreviated form for folium. 
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
    'District of Columbia':'DC'
}
# Now lets change the keys in the state_shooting dictionary by making a copy of the values from that dictionary. 
abbr_shootings = {}
for i, r in data.iterrows():
    state = us_state_abbrev[r['state']]
    # we will extract the number of deaths from each tuple and then add it to the current value of the new dictionary
    # gets the abbreviation from the above dictionary and puts it as the new key. 
    if state in abbr_shootings:
        abbr_shootings[state] += r['n_killed']
    else:
        abbr_shootings[state] = r['n_killed']

# this creates our dataframe for our map below, puts it into a series and then merges into a data frame. 
state_series = pd.Series()
death_series = pd.Series()
count  = 0
for i in abbr_shootings.keys():
    state_series = state_series.set_value(count, i)
    count = count + 1
count = 0
for i in abbr_shootings.values():
    death_series = death_series.set_value(count, i)
    count = count + 1
df_state = pd.concat([state_series, death_series], axis = 1)


# Reads the github json file of the US states to make the map below 
import urllib.request, json 
with urllib.request.urlopen("https://raw.githubusercontent.com/python-visualization/folium/master/examples/data/us-states.json") as url:
    geodata = json.loads(url.read().decode())
map_heat = folium.Map(location=[37.8, -102], zoom_start=4)
# Creates a map with the deaths, the darker the shade, the more deaths. 
map_heat.choropleth(geo_data=geodata, data=df_state,
             columns=[0, 1],
             key_on='feature.id',
             fill_color='OrRd', fill_opacity=0.7, line_opacity=0.2,
             legend_name='Death Rate (Count)')
map_heat
/anaconda2/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:71: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
/anaconda2/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:75: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
/anaconda2/envs/py36/lib/python3.6/site-packages/folium/folium.py:432: FutureWarning: The choropleth  method has been deprecated. Instead use the new Choropleth class, which has the same arguments. See the example notebook 'GeoJSON_and_choropleth' for how to do this.
  FutureWarning
Out[35]:

Observations

Interactive Map

The interactive map shows us that the gun violence incidents from 2013-2018 are primarily concentrated in the east and west coasts, with most of the incidents occurring on the East coast. Furthermore, shootings are concentrated in cities upon further inspection of the clustered regions.

Heatmap

Our heatmap shows a clear concentration of incidents in California and Texas, with Florida and some of the northeast states slightly behind. Policy makers can take this into account when examining the effectiveness of various laws in various states, and which states require the most work in bringing the number of these incidents down.

Hypothesis Testing and Linear Regression

Now we discuss the next step in the data science pipeline: Hypothesis Testing and Linear Regression. Statistics in general provide us with numerous tools to take large sets of data and make empirically evident statements based on the dataset. One of the tools is Hypothesis Testing.

Hypothesis Testing

Hypothesis Testing is used to prove a statistical claim. You can learn more about Hypothesis Testing. In this article, you can learn the different types of ways you can do hypothesis testing.

Steps in Hypothesis Testing

  1. State the NULL Hypothesis
  2. State the Alternative Hypothesis
  3. Calculate a test statistic
  4. Choose the acceptance region and rejection regions
  5. Based on steps 3 and 4, draw a conclusion about the null hypothesis.

Now we will go through each step for our data

Step 1 - State the NULL Hypothesis

This is the step where we choose a claim that is opposite to the claim that we are trying to make. This is because the basis of any research study is to disrupt the current status quo, and with a null hypothesis that we can reject we can disrupt the current state. In today's day gun violence has increased by very much. This increase comes with many harmful side-effects. A prominent effect would be gruesome nature of all of these gun crimes the number of

This article: Guardian a given statistical model that, when the null hypothesis is true, the statistical summary would be greater than or equal to the actual observed results. From this we can formulate the NULL Hypothesis. Remember that this would be opposite to our claim.

NULL Hypothesis: Over the course of the years, the rate of increase in number of deaths in mass shootings is not increasing every year

Step 2 - State the Alternative Hypothesis

As we said earlier, this is the hypothesis that we want to accept because this will disrupt the status quo and bring new information in this world.

Alternative Hypothesis: Over the course of the years, the rate of increase in number of deaths in mass shootings is increasing every year

Step 3 - Calculate a test statistic

This is to calculate a p value, which is the statistical significance of a statistic while the null hypothesis is assumed to be true. The p value is essentially the probability of getting a given sample if the null hypothesis is assumed to be true. For a large value of this value, we can assume that the probability is really high and that we do not need to reject the NULL Hypothesis. In the next step we will set the limits for when we should reject the NULL hypothesis.

You can find more about P-Values here: P-Values

To calculate the P-Value, we we will use the T-Test. The T-Test is one of the ways to calculate the P-Value. We will, using the stats module from the scipy library, calculate the T-test for the year and harmed values for each crime.

You can find more about T-Test here: T-Test

You can find more about stats here: stats

Step 4 - Choose the acceptance region and rejection regions

In this step, we choose a threshold for our P-Values. P-Values have a vast range, and it is imperative to list for which values should we accept or reject the null hypothesis. The higher the limit is, the harder it is for the NULL Hypothesis to pass. In our case, we would reject the null hypothesis if the p-value is less than or equal to 0.005. This value is a very standard value in statistics for rejection regions of p-values.

You can find more about this here: Accepting and Rejection

Step 5 - Based on steps 3 and 4, draw a conclusion about the null hypothesis.

Here is the first part of our last step. Here we have shown by ommiting the year 2018, as it has not been completed yet, we can see the pattern amongst shootings better. In this first part of our analysis for the conclusion, we have used Linear Regression. In statistics, we use curve-fitting and linear regression as modes of explaining (fitting) the data. Both methods have their own strengths and weaknesses, but most importantly, these methods are both suitable for their different conditions.

Linear Regression is more suitable for the case where we do not have much data to work with, and we have to extrapolate or explain a trend in a data, then we would choose Linear Regression. In our case, we have very few years to work with, and we have ample crimes per year. The Linear Regression would also benefit greatly for predicting future values. These future values would be used to see the percent increase in 2017's values.

We will not be using curve-fitting in our case, since even though this line best fits the points, we want to use Linear Regression's ability to use predict and extrapolate values, and this will help us compare shootings in general and mass shootings and rate of increase.

To learn more about Curve-Fitting: Curve-Fitting

To learn more about Linear Regression: Linear Regression

In the code below, we are building a new dataframe that will store the number of people who were harmed every year due to gun violence.

Analysis of Dataframe with Total Harm per Year for all Shootings

In [41]:
years = [2013, 2014, 2015, 2016, 2017]
#store the years 
total_harmed = []
#here we are taking all shootings
total_harmed.append(data[data['year'] == 2013]['harmed'].sum())
total_harmed.append(data[data['year'] == 2014]['harmed'].sum())
total_harmed.append(data[data['year'] == 2015]['harmed'].sum())
total_harmed.append(data[data['year'] == 2016]['harmed'].sum())
total_harmed.append(data[data['year'] == 2017]['harmed'].sum())
#stores all the number of people who got harmed 
new_data = pd.DataFrame(data=years, columns = ['years'])
new_data['harmed'] = total_harmed
#add to the current dataframe
In [42]:
print(stats.ttest_ind(new_data['years'], new_data['harmed']))
# This here to zip to make the array for Linear Regression.
years = []
for year in new_data['years']:
    years.append([year])  
new_data['harmed'].groupby(new_data['years']).sum().plot()
reg=linear_model.LinearRegression()
reg.fit(years,new_data['harmed'])
# m is the slope of the regression line
m=reg.coef_[0]
b=reg.intercept_
predicted_values = [reg.coef_ * i + reg.intercept_ for i in years]
plt.plot(years, predicted_values, 'b')
plt.gcf().set_size_inches(10, 8)
plt.show()
Ttest_indResult(statistic=-3.8053548542023803, pvalue=0.005196880885259166)
In [38]:
rate_all = reg.predict(2018) - total_harmed[4]
rate_all = (rate_all/total_harmed[4]) * 100
print(rate_all)
[38.07525858]

Analysis of Dataframe with Total Harm per Year for all Mass Shootings

In [39]:
years = [2013, 2014, 2015, 2016, 2017]
total_harmed = []
# this is similar to before, but now we are only accounting for mass shootings.
total_harmed.append(data[(data['year'] == 2013) & (data['incident_characteristics'] == "Mass Shooting(4+ Deaths/Injuries)")]['harmed'].sum())
total_harmed.append(data[(data['year'] == 2014) & (data['incident_characteristics'] == "Mass Shooting(4+ Deaths/Injuries)")]['harmed'].sum())
total_harmed.append(data[(data['year'] == 2015) & (data['incident_characteristics'] == "Mass Shooting(4+ Deaths/Injuries)")]['harmed'].sum())
total_harmed.append(data[(data['year'] == 2016) & (data['incident_characteristics'] == "Mass Shooting(4+ Deaths/Injuries)")]['harmed'].sum())
total_harmed.append(data[(data['year'] == 2017) & (data['incident_characteristics'] == "Mass Shooting(4+ Deaths/Injuries)")]['harmed'].sum())
new_data = pd.DataFrame(data=years, columns = ['years'])
new_data['harmed'] = total_harmed

# Conducting the T-Test 
print(stats.ttest_ind(new_data['years'], new_data['harmed']))
# This here to zip to make the array for Linear Regression.
years = []
for year in new_data['years']:
    years.append([year])  
new_data['harmed'].groupby(new_data['years']).sum().plot()
reg=linear_model.LinearRegression()
reg.fit(years,new_data['harmed'])
# m is the slope of the regression line
m=reg.coef_[0]
b=reg.intercept_
predicted_values = [reg.coef_ * i + reg.intercept_ for i in years]
plt.plot(years, predicted_values, 'b')
plt.gcf().set_size_inches(10, 8)
plt.show()
Ttest_indResult(statistic=3.0292967237468047, pvalue=0.016326873458966606)
In [40]:
rate_mass = reg.predict(2018) - total_harmed[4]
rate_mass = (rate_mass/total_harmed[4]) * 100
print(rate_mass)
[20.25876941]

Here we have analyzed the dataframe with all mass shootings. This dataframe is our main target dataframe, as this is the dataframe that gets the T-Test. The result of the T-Test is that the P-Value is approximately 0.016, and this means that we do not have to reject our NULL Hypothesis. This result can be further be reinforced by the stark difference in the rate of increase for all shootings and only mass shootings.

This result also entails that while the world is in danger today, there are many people in this world who will try to induce fear by overestimating the probability of a mass shooting. But our results show that the latest resurgence in mass shootings has been a call for action, but doesn't mean that the rate of increase of mass shootings is erratic. The rate is high, around 20%, but this is less significant than the 38% increase predicted for all shootings.

Overall, results of this analysis suggest that while mass shootings are become a danger to our socieity, we should not be as alarmed yet. This is also reinforced by the little difference between the rejection region and our P-Value. This little difference suggests that in the later future, we can expect to reject the NULL Hypothesis.

4. Informing Policy and Conclusion

This marks the end of our data science pipeline. This is the point where we will discuss the insights learned from this tutorial. Through this tutorial, we were able to properly analyse and process the data. Then, we defined our hypothesis and tested it to see if we should reject our hypothesis. After our analysis, we were able to see that the percent increase in the general shootings rate was higher than the mass shootings rate.

This tells us that while mass shooting deaths are certainly growing across America, they are not growing as quickly as shooting deaths in general. As a result, our policy making should focus not only on reducing mass shootings, but also on reducing shooting related incidents overall.

Through our data lifecycle we showed how much of a problem all gun violence is in America. Our map and heatmap showed that gun violence is spread all throughout the country, but concentrated on the coasts. Our linear regression model showed a rise in gun violence related deaths and injuries, with our predictive model forecasting a 38% increase in total gun violence related injures/deaths and a 20% increase in specifically mass shooting related injuries/deaths.